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Abstract. We raise the question of approximating the compressibility of a string with respect to a 
fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless 
compression schemes: run- length encoding (RLE) and Lempel-Ziv (LZ), and present sublinear algo- 
Ch . rithms for approximating compressibility with respect to both schemes. We also give several lower 

bounds that show that our algorithms for both schemes cannot be improved significantly. 
Our investigation of LZ yields results whose interest goes beyond the initial questions we set out to 
00 . study. In particular, we prove combinatorial structural lemmas that relate the compressibility of a string 

with respect to Lempel-Ziv to the number of distinct short substrings contained in it. In addition, we 
, show that approximating the compressibility with respect to LZ is related to approximating the support 

' size of a distribution. 

1 Introduction 

>• ■ Given an extremely long string, it is natural to wonder how compressible it is. This fun- 
^ ■ damental question is of interest to a wide range of areas of study, including computational 
O ' complexity theory, machine learning, storage systems, and communications. As massive data 
sets are now commonplace, the ability to estimate their compressibility with extremely effi- 
^ I cient, even sublinear time, algorithms, is gaining in importance. The most general measure 
' of compressibility, Kolmogorov complexity, is not computable (see [14] for a textbook treat- 
O ■ ment), nor even approximable. Even under restrictions which make it computable (such as a 
bound on the running time of decompression), it is probably hard to approximate in polyno- 
^ ! mial time, since an approximation would allow distinguishing random from pseudorandom 
^ I strings and, hence, inverting one-way functions. However, the question of how compressible 
a large string is with respect to a specific compression scheme may be tractable, depending 
on the particular scheme. 

We raise the question of approximating the compressibility of a string with respect to a 
fixed compression scheme, in sublinear time, and give algorithms and nearly matching lower 
bounds for several versions of the problem. While this question is new, for one compression 
scheme, answers follow from previous work. Namely, compressibility under Huffman encoding 
is determined by the entropy of the symbol frequencies. Batu et al. [3] and Brautbar and 
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Samorodnitsky [5] study the problem of approximating the entropy of a distribution from a 
small number of samples, and their results immediately imply algorithms and lower bounds 
for approximating compressibility under Huffman encoding. 

In this work we study the compressibility approximation question in detail for two popular 
lossless compression schemes: run-length encoding (RLE) and Lempel-Ziv (LZ) [18]. In the 
RLE scheme, each run, or a sequence of consecutive occurrences of the same character, is 
stored as a pair: the character, and the length of the run. Run-length encoding is used to 
compress black and white images, faxes, and other simple graphic images, such as icons and 
line drawings, which usually contain many long runs. In the LZ scheme^, a left-to- right pass 
of the input string is performed and at each step, the longest sequence of characters that 
has started in the previous portion of the string is replaced with the pointer to the previous 
location and the length of the sequence (for a formal definition, see Section 4). The LZ 
scheme and its variants have been studied extensively in machine learning and information 
theory, in part because they compress strings generated by an ergodic source to the shortest 
possible representation (given by the entropy) in the asymptotic limit (cf. [10]). Many popular 
archivers, such as gzip, use variations on the LZ scheme. In this work we present sublinear 
algorithms and corresponding lower bounds for approximating compressibility with respect 
to both schemes, RLE and LZ. 

Motivation. Computing the compressibility of a large string with respect to specific com- 
pression schemes may be done in order to decide whether or not to compress the file, to 
choose which compression method is the most suitable, or check whether a small modifica- 
tion to the file (e.g., a rotation of an image) will make it significantly more compressible^. 
Moreover, compression schemes are used as tools for measuring properties of strings such as 
similarity and entropy. As such, they are applied widely in data-mining, natural language 
processing and genomics (see, for example, Lowenstern et al. [15], Kukushkina et al. [11], 
Benedetto et al. [4], Li et al. [13] and Calibrasi and Vitanyi [8,9]). In these applications, one 
typically needs only the length of the compressed version of a file, not the output itself. For 
example, in the clustering algorithm of [8], the distance between two objects x and y is given 
by a normalized version of the length of their compressed concatenation x\\y. The algorithm 
first computes all pairwise distances, and then analyzes the resulting distance matrix. This 
requires Oit^) runs of a compression scheme, such as gzip, to cluster t objects. Even a weak 
approximation algorithm that can quickly rule out very incompressible strings would reduce 
the running time of the clustering computations dramatically. 

Multiplicative and Additive Approximations. We consider three approximation notions: ad- 
ditive, multiplicative, and the combination of additive and multiplicative. On the input of 
length n, the quantities we approximate range from 1 to n. An additive approximation al- 
gorithm is allowed an additive error of en, where e G (0, 1) is a parameter. The output of a 
multiplicative approximation algorithm is within a factor A > 1 of the correct answer. The 

^ We study the variant known as LZ77 [18], which achieves the best compressibihty. There are several other variants 
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combined notion allows both types of error: the algorithm should output an estimate C of 
the compression cost C such that ^ — en < C < A • C + en. Our algorithms are randomized, 
and for all inputs the approximation guarantees hold with probability at least |. 

We are interested in sublinear approximation algorithms, which read few positions of the 
input strings. For the schemes we study, purely multiplicative approximation algorithms must 
read almost the entire input. Nevertheless, algorithms with additive error guarantees, or a 
possibility of both multiplicative and additive error are often sufficient for distinguishing very 
compressible inputs from inputs that are not well compressible. For both the RLE and LZ 
schemes, we give algorithms with combined multiplicative and additive error that make few 
queries to the input. When it comes to additive approximations, however, the two schemes 
differ sharply: sublinear additive approximations are possible for the RLE compressibility, 
but not for LZ compressibility. 

1.1 Results for Run-Length Encoding 

For RLE, we present sublinear algorithms for all three approximation notions defined above, 
providing a trade-off between the quality of approximation and the running time. The al- 
gorithms that allow an additive approximation run in time independent of the input size. 
Specifically, an en-additive estimate can be obtained in time^ 0(l/e^), and a combined es- 
timate, with a multiplicative error of 3 and an additive error of en, can be obtained in 
time 0(l/e). As for a strict multiplicative approximation, we give a simple 4-multiplicative 
approximation algorithm that runs in expected time 0{ c^(w) ) where Crie(u') denotes the 
compression cost of the string w. For any 7 > 0, the multiplicative error can be improved to 
I-I-7 at the cost of multiplying the running time by poly(l/7). Observe that the algorithm is 
more efficient when the string is less compressible, and less efficient when the string is more 
compressible. One of our lower bounds justifies such a behavior and, in particular, shows that 
a constant factor approximation requires linear time for strings that are very compressible. 
We also give a lower bound of i7(l/e^) for en-additive approximation. 

1.2 Results for Lempel-Ziv 

We prove that approximating compressibifity with respect to LZ is closely related to the 
following problem, which we call COLORS: Given access to a string r of length n over 
alphabet ^, approximate the number 0/ distinct symbols ("colors") in r. This is essentially 
equivalent to estimating the support size of a distribution [17]. Variants of this problem 
have been considered under various guises in the fiterature: in databases it is referred to as 
approximating distinct values (Charikar et al. [7]), in statistics as estimating the number 
of species in a population (see the over 800 references maintained by Bunge [6]), and in 
streaming as approximating the frequency moment Fq (Alon et al. [1], Bar-Yossef et al. [2]). 
Most of these works, however, consider models different from ours. For our model, there is 
an A-multiplicative approximation algorithm of [7], that runs in time O (^), matching the 
lower bound in [7, 2] . There is also an almost finear lower bound for approximating Colors 
with additive error [17]. 

^ The notation 0{g{k)) for a function g of a parameter k means 0{g{k) ■ polylog(g'(fc)) where polylog(g'(/c)) = 
log'^(g(fe)) for some constant c. 
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We give a reduction from LZ compressibility to COLORS and vice versa. These reductions 
allow us to employ the known results on COLORS to give algorithms and lower bounds for this 
problem. Our approximation algorithm for LZ compressibility combines a multiplicative and 
additive error. The running time of the algorithm is O (^) where A is the multiplicative 
error and en is the additive error. In particular, this implies that for any a > 0, we can 
distinguish, in sublinear time strings compressible to symbols from strings 

only compressible to i7(n) symbols.^ 

The main tool in the algorithm consists of two combinatorial structural lemmas that 
relate compressibility of the string to the number of distinct short substrings contained in 
it. Roughly, they say that a string is well compressible with respect to LZ if and only if it 
contains few distinct substrings of length i for all small i (when considering all n — ^ + 1 
possible overlapping substrings) . The simpler of the two lemmas was inspired by a structural 
lemma for grammars by Lehman and Shelat [12]. The combinatorial lemmas allow us to 
establish a reduction from LZ compressibility to Colors and employ a (simple) algorithm 
for approximating COLORS in our algorithm for LZ. 

Interestingly, we can show that there is also a reduction in the opposite direction: namely, 
approximating COLORS reduces to approximating LZ compressibility. The lower bound 
of [17], combined with the reduction from COLORS to LZ, implies that our algorithm for 
LZ cannot be improved significantly. In particular, our lower bound implies that for any 
B = n°^^\ distinguishing strings compressible by LZ to 0{n/B) symbols from strings com- 
pressible to f2{n) symbols requires n^""^-*^^ queries. 

1.3 Further Reseeirch 

It would be interesting to extend our results for estimating the compressibility under LZ77 
to other variants of LZ, such as dictionary-based LZ78 [19]. Compressibility under LZ78 can 
be drastically different from compressibility under LZ77: e.g., for 0" they differ roughly by a 
factor of ^Jn. Another open question is approximating compressibility for schemes other than 
RLE and LZ. In particular, it would be interesting to design approximation algorithms for 
lossy compression schemes such as JPEG, MPEG and MPS. One lossy compression scheme 
to which our results extend directly is Lossy RLE, where some characters, e.g., the ones that 
represent similar colors, are treated as the same character. 

1.4 Organization 

We start with some definitions in Section 2. Section 3 contains our results for RLE. Section 4 
deals with the LZ scheme. All missing details (descriptions of algorithms and proofs of claims) 
can be found in [16]. 

2 Preliminaries 

The input to our algorithms is usually a string w of length n over a finite alphabet S. The 
quantities we approximate, such as compression cost of w under a specific algorithm, range 

To see this, set A = o(n"/^) and e = o(n-"/^). 
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from 1 to n. We consider estimates to these quantities that have both multiphcative and 
additive error. We call C an (A, e) -estimate for C if j — en < C < \ - C + en , and say an 
algorithm (A, e)-estimates C (or is an (A, t)- approximation algorithm for C) if for each input 
it produces an (A, e)-estimatc for C with probability at least |. 

When the error is purely additive or multiplicative, we use the following shorthand: 
en-additive estimate stands for {l,e)- estimate and X-multiplicative estimate, or X-estimate, 
stands for (A, 0)-estimate. An algorithm computing an en- additive estimate with probability 
at least | is an en-additive approximation algorithm, and if it computes an A-multiplicative 
estimate then it is an X-multiplicative approximation algorithm, or X- approximation algo- 
rithm. 

For some settings of parameters, obtaining a valid estimate is trivial. For a quantity in 
for example, | is an ^-additive estimate, y/n is a -v/n-estimate and en is an (A, e)- 
estimate whenever A > tt- 



3 Run-Length Encoding 

Every n-character string w over alphabet E can be partitioned into maximal runs of identical 
characters of the form a^, where cr is a symbol in E and i is the length of the run, and 
consecutive runs are composed of different symbols. In the Run-Length Encoding of w, each 
such run is replaced by the pair {a, i) . The number of bits needed to represent such a pair 
is flog(£ + 1)] + [log \ plus the overhead which depends on how the separation between 
the characters and the lengths is implemented. One way to implement it is to use prefix-free 
encoding for lengths. For simplicity we ignore the overhead in the above expression, but our 
analysis can be adapted to any implementation choice. The cost of the run-length encoding, 
denoted by Cr\e{w), is the sum over all runs of |"log(£ -|- 1)] -|- [log 1^1]. 



3.1 An en- Additive Estimate with 0(l/e^) Queries 

Our first algorithm for approximating the cost of RLE is very simple: it samples a few 
positions in the input string uniformly at random and bounds the lengths of the runs to 
which they belong by looking at the positions to the left and to the right of each sample. If 

the corresponding run is short, its length is established exactly; if it is long, we argue that it 
does not contribute much to the encoding cost. For each index t E [n], let £{t) be the length 
of the run to which Wt belongs. The cost contribution of index t is defined as 

_ \\ogm + 1)] + \iog\E\] 

By definition, --^ = E [cit)], where E^gf^i denotes expectation over a uniformly random 

n te[n] 

choice of t. The algorithm, presented below, estimates the encoding cost by the average of 
the cost contributions of the sampled short runs, multiplied by n. 
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Algorithm I: An en-ADDiTivE Approximation for Crie{w) 

1. Select q = (^) indices ti, . . . ,tq uniformly and independently at random. 

2. For each i E [q] : 

(a) Query and up to £o = ^'"g^^^^l^l/^) positions in its vicinity to bound i{ti). 

(b) Set c{ti) — c{ti) if £{ti) < £o and c{ti) — otherwise. 

3. Output Crie — n ■ E [c(ii)]. 

ie[q] 



Correctness. We first prove that the algorithm is an en-additive approximation. The error 
of the algorithm comes from two sources: from ignoring the contribution of long runs and 
from sampling. The ignored indices t, for which i{t) > io, do not contribute much to the 
cost. Since the cost assigned to the indices monotonically decreases with the length of the 
run to which they belong, for each such index, 

c(t) < rM<o+ 1)1 + ri°g 1^11 < i 

io 2 

Therefore, 

C^_e < 1. y < CrieH^ 

n 2 - n ^ ^ ' - n ^ ' 

t:e{t)<eo 

Equivalently, ^ - f < E.^inMU)] < 

By an additive Chernoff bound, with high constant probability, the sampling error in 
estimating E[c(ii)] is at most e/2. Therefore, Crie is an en-additive estimate of C^ie{w), as 
desired. 

Query and time complexity. (Assuming \E\ is constant.) Since the number of queries 
performed for each selected ti is 0{io) = 0(log(l/e)/e), the total number of queries, as well 
as the running time, is 0(log(l/e)/e^). 

3.2 Summary of Positive Results on RLE 

After stating Theorem 1 that summarizes our positive results, we briefly discuss some of the 
ideas used in the algorithms omitted from this version of the paper. 

Theorem 1 Let w e Z"" be a string to which we are given query access. 

1. Algorithm I gives en-additive approximation to Crie(w) in time 0{l/e^). 

2. Crie(w) can he {3, e)- estimated in time 0(l/e). 

3. Crie(w) can be A-estimated in expected time O ^ c i\w) ) • ^ (1 + j)-estimate of Cr\e{w) 

can be obtained in expected time O (^-q^Jw) ' P^^ji^/'y)^ ■ -^^^ algorithm needs no prior 
knowledge ofCrie{w). 
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Section 3.1 gives a complete proof of Item 1. The algorithm in Item 2 partitions the 

positions in the string into buckets according to the length of the runs they belong to. It 
estimates the sizes of different buckets with different precision, depending on the size of 
the bucket and the length of the runs it contains. The main idea in Item 3 is to search for 
Crie(u'), using the algorithm from Item 2 repeatedly (with different parameters) to establish 
successively better estimates. 



3.3 Lower Bounds for RLE 

We give two lower bounds, for multiplicative and additive approximation, respectively, which 
establish that the running times in Items 1 and 3 of Theorem 1 are essentially tight. 

Theorem 2 1. For all A > 1, any A- approximation algorithm for Crie requires Q ^^l^gn) 
queries. Furthermore, if the input is restricted to strings with compression cost C^ieiw) > 
C , then i? i^ cA'^'iozin) ) ^^^nes are necessary. 

2. For all e e (O, i), any en-additive approximation algorithm for Crie requires i7(l/e^) 
queries. 

A Multiplicative Lower Bound (Proof of Theorem 2, Item 1): The claim follows from the 
next lemma: 

Lemma 3 For every n > 2 and every integer 1 < k < n/2, there exists a family of strings, 
denoted Wk, for which the following holds: (1) C^xdw) = (A;log(^)) for every w G Wki (2) 
Distinguishing a uniformly random string in Wk from one in Wk', where k' > k, requires 
i? (p-) queries. 

Proof: Let E = {0, 1} and assume for simplicity that n is divisible by k. Every string 
in Wk consists of k blocks, each of length |. Every odd block contains only Is and every 
even block contains a single 0. The strings in Wk differ in the locations of the Os within 
the even blocks. Every w G Wk contains k/2 isolated Os and k/2 runs of Is, each of length 
©(I). Therefore, Cridw) = (A;log(|)). To distinguish a random string in Wk from one 
in Wk' with probability 2/3, one must make ^( max(fc k') ^ queries since, in both cases, with 
asymptotically fewer queries the algorithm sees only I's with high probability. ■ 

Additive Lower Bound (Proof Theorem, 2, Item 1): For any p G [0, 1] and sufficiently large n, 
let T>n^p be the following distribution over n-bit strings. For simplicity, consider n divisible 
by 3. The string is determined by | independent coin flips, each with bias p. Each "heads" 
extends the string by three runs of length 1, and each "tails", by a run of length 3. Given 
the sequence of run lengths, dictated by the coin flips, output the unique binary string that 
starts with and has this sequence of run lengths.^ 

Let W he a. random variable drawn according to and W' , according to Vn^ii2+e- 

The following facts are estabhshed in the full version [16]: (a) Q{l/t^) queries are necessary 
to reliably distinguish W from W' , and (b) With high probability, the encoding costs of W 
and W' differ by Q{en). Together these facts imply the lower bound. ■ 

® Let hi be a boolean variable representing the outcome of the ith coin. Then the output is 0fei0162l0630164l ■ ■ ■ 
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4 Lempel Ziv Compression 

In this section we consider a variant of Lempel and Ziv's compression algorithm [18], which 
we refer to as LZ77. In all that follows we use the shorthand [n] for {1, . . . , n}. Let w e Z"" 

be a string over an alphabet E. Each symbol of the compressed representation of denoted 
LZ{w), is either a character a E S or a pair {p,i) where p G [n] is a pointer (index) 
to a location in the string w and i is the length of the substring of w that this symbol 
represents. To compress w, the algorithm works as follows. Starting from t = 1, at each step 
the algorithm finds the longest substring Wt . . . Wt+^-i for which there exists an index p < t, 
such that Wp. . . — Wt . . ■ u)t+e-i- (The substrings Wp. . . and Wt--- Wt+e.-i may 

overlap.) If there is no such substring (that is, the character Wt has not appeared before) 
then the next symbol in LZ{w) is Wt, and t = t + 1. Otherwise, the next symbol is {pj) 
and t = t + £. We refer to the substring Wt . . . Wf+e-i {oi Wf when Wt is a new character) as 
a compressed segment. 

Let Ci^z{w) denote the number of symbols in the compressed string LZ{w). (We do not 
distinguish between symbols that arc characters in U. and symbols that are pairs {p,i)-) 
Given query access to a string w G Z"", we are interested in computing an estimate Clz of 
Clz(w). As we shall see, this task reduces to estimating the number of distinct substrings in 
w of different lengths, which in turn reduces to estimating the number of distinct characters 
("colors") in a string. The actual length of the binary representation of the compressed 
substring is at most a factor of 21ogn larger than Clz(w^)- This is relatively negligible given 
the quality of the estimates that we can achieve in sublinear time. 

We begin by relating LZ compressibility to COLORS (§4.1), then use this relation to 
discuss algorithms (§4.2) and lower bounds (§4.3) for compressiblity. 

4.1 Structural Lemmas 

Our algorithm for approximating the compressibility of an input string with respect to LZ77 
uses an approximation algorithm for COLORS (defined in the introduction) as a subroutine. 
The main tool in the reduction from LZ77 to COLORS is the relation between Ci^ziw) and 
the number of distinct substrings in w, formalized in the two structural lemmas. In what 
follows, d£{w) denotes the number of distinct substrings of length £ in w. Unlike compressed 
segments in w, which are disjoint, these substrings may overlap. 

Lemma 4 (Structural Lemma 1) For every I G \p\, Clz{w) > 

Lemma 5 (Structural lemma 2) Let £q G [n] . Suppose that for some integer m and for 
every I G [4], dniyS) <m- 1. Then Clz{w) < 4(mlog£o + n/£o)- 

Proof of Lemma 4. This proof is similar to the proof of a related lemma concerning 
grammars from [12]. First note that the lemma holds ior £ — 1, since each character wt in 
w that has not appeared previously (that is, Wt' ^ Wf for every t' < t) is copied by the 
compression algorithm to LZ{w). 
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For the general case, fix £ > 1. Recall that Wt . . . tvt+k-i of w is a compressed segment 
if it is represented by one symbol {p,k) in LZ{w). Any substring of lenth I that occurs 
within a compressed segment must have occurred previously in the string. Such substrings 
can be ignored for our purposes: the number of distinct length-^ substrings is bounded above 
by the number of length-^ substrings that start inside one compressed segment and end 
in another. Each segment (except the last) contributes {I — 1) such substrings. Therefore, 
dt{w) < {C^z{w) - 1)(£ - 1) < C^z{w) ■ i for every i>\. ■ 

Proof of Lemma 5. Let Ufiw) denote the number of compressed segments of length i in 
w, not including the last compressed segment. We use the shorthand for ni{w) and o?^ for 
diiyS). In order to prove the lemma we shall show that for every 1 < ^ < [^o/2j , 

Y^n,<2{m^\)-Y\- (4) 

k=\ k=l 

For all i > 1, since the compressed segments in w are disjoint, X]fc=^+i^fc — J+i- 
substitute i = L^o/2j in the last two equations and sum them up, we get: 

Vnfe<2(m+1)- V - + -^<2(m+l)(ln£o + l) + ^. (5) 
k=i k=i ^0 ^0 

Since Ci,z{w) — Yllt=i + lemma follows. 

It remains to prove Equation (4) . We do so below by induction on I, using the following 
claim. 

e. 

Claim 6 For every 1 < £ < [4/2] , ■ Uk < 2£(m + 1) . 

fc=i 

Proof: We show that each position j e {£,..., n — £} that participates in a compressed 
substring of length at most Ivaw can be mapped to a distinct length-2^ substring of w. Since 
£ < -^0/2, by the premise of the lemma, there are at most 2£-m distinct length-2£ substrings. 
In addition, the first £ — 1 and the last £ positions contribute less than 2£ symbols. The claim 
follows. 

We call a substring new if no instance of it started in the previous portion of w. Namely, 
Wf... Wt+e-i is new if there is no p < t such that Wf . . . Wt+e.-i — Wp . . . typ+^_i. Consider a 
compressed substring Wt . . . Wt+k-i of length k < £. The substrings of length greater than k 
that start at Wt must be new^ since LZ77 finds the longest substring that appeared before. 
Furthermore, every substring that contains such a new substring is also new. That is, every 
substring wt' ■ ■ ■ Wt+k' where t' <t and k' > k + {t' — t), is new. 

Map each position j G {£,... ,n — £} in the compressed substring Wt . . . Wt+k-i to the 
length- 2£ substring that ends at wj+i. Then each position in {£,... ,n — £} that appears in 
a compressed substring of length at most £ is mapped to a distinct length- 2£ substring, as 
desired. ■ (Claim 6) 
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Establishing Equation (4)- We prove Equation (4) by induction on i. Claim 6 with i set 

to 1 gives the base case, i.e., ni < 2{m + 1). For the induction step, assume the induction 
hypothesis for every j G — 1]. To prove it for i, add the equation in Claim 6 to the sum of 
the induction hypothesis inequalities (Equation (4)) for every j G — !]• The left hand side 
of the resulting inequality is 

e i-i j e i-i e-k 

k=l j=l k=l k=l k=l j=l 

e e-i e 

= ^ A; • rijfc + - k) -Uk^ i-^Uk . 

k=l k=l k=l 

The right hand side, divided by the factor 2(m + 1), which is common to all inequalities, is 

<+EE5 = '+EEs ='+E^=*+'-Es-(*-i) = <-Er 

A/ ft ft fV ft 

j=l fe=l k=l 3=1 k=l k=l k=l 

Dividing both sides by £ gives the inequality in Equation (4). ■ (Lemma 5) 



4.2 An Algorithm for LZ77 

This subsection describes an algorithm for approximating the compressibility of an input 
string with respect to LZ77, which uses an approximation algorithm for COLORS as a sub- 
routine. The main tool in the reduction from LZ77 to COLORS consists of structural lemmas 4 
and 5, summarized in the following corollary. 

Corollary 7 For any 4 > 1, let m = m(4) = max^i^^ Then 

m < Clz{w) < 4-(mlog£o + ^ 

V ^0 

The corollary allows us to approximate Clz from estimates for rf^ for all I G To obtain 
these estimates, we use the algorithm of [7] for Colors as a subroutine (in the full version [16] 
we also describe a simpler Colors algorithm with the same provable guarantees). Recall 
that an algorithm for COLORS approximates the number of distinct colors in an input string, 
where the ith character represents the ith color. We denote the number of colors in an input 
string r by Ccol(t)- To approximate rf^, the number of distinct length-f substrings in -u;, 
using an algorithm for COLORS, view each length-£ substring as a separate color. Each query 
of the algorithm for COLORS can be implemented by £ queries to w. 

Let Estimate(£, 5) be a procedure that, given access to w, an index £ G [n], an ap- 
proximation parameter B — B{n,£) > 1 and a confidence parameter 5 G [0,1], computes 
a S-estimate for de with probability at least 1 — S. It can be implemented using an algo- 
rithm for Colors, as described above, and employing standard amplification techniques to 
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boost success probability from | to 1 — 5: running the basic algorithm 0{\ogS~^) times and 
outputting the median. Since the algorithm of [7] requires 0{n/B'^) queries, the query com- 
plexity of Estimate(£, 5, 5) is O £log(5^^). Using Estimate(£, 5) as a subroutine, 
we get the following approximation algorithm for the cost of LZ77. 



Algorithm II: An (A, e)- approximation for Clz(w) 



1. Set io = \£\ and B 



2^1og(2/(A.)) • 



2. For aU £ in [£o], let de = Estimate(£, B, 

3. Combine the estimates to get an approximation of m from Corollary 7: set ttt. = max . 

4. Output Clz — rh ■ 4 + en. 



Theorem 8 Algorithm II {A, e)- estimates Ciziw) . With a proper implementation that reuses 
queries and an appropriate data structure, its query and time complexity are O (^) • 

Proof: By the Union Bound, with probability > |, all values di computed by the algorithm 
are 5-estimates for the corresponding d^. When this holds, rn is a S-estimate for m from 
Corollary 7, which implies that 

^ < ClzH < 4- (mB\ogeo+- 
Equivalently, ^ , ^ / < m < B ■ Clz- Multiplying all three terms by 4 and adding en 

4:B log tQ " 

to them, and then substituting parameter settings for £q and B, specified in the algorithm, 
shows that Clz is indeed an [A, e)-estimate for Clz- 

As explained before the algorithm statement, each call to Estimate(^, B, ■^) costs 
O (-^ £log£o) queries. Since the subroutine is called for all £ e [£o], the straightforward 
implementation of the algorithm would result in O (-^^Qlog^o) queries. Our analysis of the 
algorithm, however, does not rely on independence of queries used in difi^erent calls to the 
subroutine, since we employ the Union Bound to calculate the error probability. It will still 
apply if we first run Estimate to approximate and then reuse its queries for the re- 
maining calls to the subroutine, as though it requested to query only the length-^ prefixes of 
the length-£o substrings queried in the first call. With this implementation, the query com- 
plexity is O {-§2^0 log^o) = O log^ ^) . To get the same running time, one can maintain 
counters for all i G [£o] for the number of distinct length-^ substrings seen so far and use a 
trie to keep the information about the queried substrings. Every time a new node at some 
depth £ is added to the trie, the ^th counter is incremented. ■ 

4.3 Lower Bounds: Reducing Colors to LZ77 

We have demonstrated that estimating the LZ77 compressibility of a string reduces to COL- 
ORS. As shown in [17], COLORS is quite hard, and it is not possible to improve much on the 
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simple approximation algorithm in [7] , on which we base the LZ77 approximation algorithm 
in the previous subsection. A natural question is whether there is a better algorithm for the 
LZ77 estimation problem. That is, is the LZ77 estimation strictly easier than Colors? As 
we shall see, it is not much easier in general. 

Lemma 9 (Reduction from Colors to LZ77) Suppose there exists an algorithm Alz 
that, given access to a string w of length n over an alphabet E, performs q = q{n, \IJ\,a,(3) 
queries and with probability at least 5/6 distinguishes between the case that Clz{w) < an 
and the case that Clz{w) > (3n, for some a < (5. 

Then there is an algorithm for COLORS taking inputs of length n' — 0{an) that performs 
q queries and, with probability at least 2/3, distinguishes inputs with at most a' n' colors from 

those with at least j3'n' colors, a' — a/2 and (3' — (3 - 2 ■ max |l, ^^|- 

Two notes are in place regarding the reduction. The first is that the gap between the 
parameters a' and (5' that is required by the COLORS algorithm obtained in Lemma 9, 
is larger than the gap between the parameters a and (5 for which the LZ-compressibility 
algorithm works, by a factor of 4 • max|l, In particular, for binary strings ^ = 

O (logn' • ^), while if the alphabet is large, say, of size at least n' , then ^ = ^ («)■ ■'■^ 
general, the gap increases by at most O(logn'). The second note is that the number of queries, 
q, is a function of the parameters of the LZ-compressibility problem and, in particular, of 
the length of the input strings, n. Hence, when writing g as a function of the parameters of 
Colors and, in particular, as a function of n' — 0{an), the complexity may be somewhat 
larger. It is an open question whether a reduction without such increase is possible. 

Prior to proving the lemma , we discuss its implications. [17] give a strong lower bound 
on the sample complexity of approximation algorithms for COLORS. An interesting special 
case is that a subpolynomial-factor approximation for Colors requires many queries even 
with a promise that the strings are only slightly compressible: for any B — n°^^^ , distinguish- 
ing inputs with n/11 colors from those with n/B colors requires queries. Lemma 9 
extends that bound to estimating LZ compressibility: For any B = n^^^), and any alpha- 
bet E , distinguishing strings with LZ compression cost Q{n) from strings with cost 0{n/B) 
requires n^~"^^^ queries. 

The lower bound for Colors in [17] applies to a broad range of parameters, and yields 
the following general statement when combined with Lemma 9: 

Corollary 10 (LZ is Hard to Approximate with Few Samples) For sufficiently 
large n, all alphabets E and all B < n^/'^/(41ogn^/^), there exist a, (5 G (0,1) where 



tween the case that Clz{w) < an and the case that Ciz{w) > [3n for w e E"', must perform 



^((f)"^) g™/orS' = 0(s.max{l,gg}) and k ^ O {^[^=^1^) . 



Proof of Lemma 9. Suppose we have an algorithm Ai,z for LZ-compressibility as specified 
in the premise of Lemma 9. Here we show how to transform a COLORS instance r into an 




algorithm that distinguishes be- 
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input for Alz, and use the output of Alz to distinguish r with at most a'n' colors from r 
with at least j3'n' colors, where a' and f3' are as specified in the lemma. We shall assume 
that f3'n' is bounded below by some sufficiently large constant. Recall that in the reduction 
from LZ77 to Colors, we transformed substrings into colors. Here we perform the reverse 
operation. 

Given a Colors instance r of length n', we transform it into a string of length n — n'-k 
over E, where k = [^]. We then run Alz on w to obtain information about r. We begin 
by replacing each color in r with a uniformly selected substring in E''. The string w is the 
concatenation of the corresponding substrings (which we call blocks). We show that: 

1. If r has at most a'n' colors, then Clz(^) ^ 2a'n; 



That is, in the first case we get an input w for COLORS such that Clz(w) < o^n for a = 2a', 



(3' . Recall that the gap between a' and (3' is assumed to be sufficiently large so that a < (3. 
To distinguish the case that Ccol(t) < a'n' from the case that Ccol(t) > f3'n', we can run 
Alz on w and output its answer. Taking into account the failure probability of Alz and the 
failure probability in Item 2 above, the Lemma follows. 

We prove these two claims momentarily, but first observe that in order to run the al- 
gorithm Alz, there is no need to generate the whole string w. Rather, upon each query of 
Alz to w, if the index of the query belongs to a block that has already been generated, the 
answer to Alz is determined. Otherwise, we query the element (color) in r that corresponds 
to the block. If this color was not yet observed, then we set the block to a uniformly selected 
substring in E'^. If this color was already observed in r, then we set the block according to 
the substring that was already selected for the color. In either case, the query to w can now 
be answered. Thus, each query to w is answered by performing at most one query to r. 

It remains to prove the two items concerning the relation between the number of colors 
in r and Clz(w). If r has at most a'n' colors then w contains at most a'n' distinct blocks. 
Since each block is of length k, at most k compressed segments start in each new block. By 
definition of LZ77, at most one compressed segment starts in each repeated block. Hence, 



If r contains P'n' or more colors, w is generated using at least P'n' ■\og{\E\'^) = /3'nlog \E\ 
random bits. Hence, with high probabihty (e.g., at least 7/8) over the choice of these ran- 
dom bits, any lossless compression algorithm (and in particular LZ77) must use at least 
Z^'nloglZ"! — 3 bits to compress w. Each symbol of the compressed version of w can be 
represented by max{flog jZ"!] , 2flogn] } + 1 bits, since it is either an alphabet symbol or 
a pointer- length pair. Since n — n'\l/a'~\, and a' > l/n', each symbol takes at most 
max{41ogn', log jZ"!} -|- 2 bits to represent. This means the number of symbols in the com- 
pressed version of w is 



2. If T has at least (3'n' colors, then Pr^[CLz(^t') > | • niin < 1 




and in the second case, with probability at least 7/8, Clz(w) > /?n for P = |-min 



/l log 1^1 1 
\ ' 41ogn' J 



Clz{w) < a'n' ■ k + {1 - a')n' < a'n + n' < 2a'n. 



ClzH > 



Z^'nloglZ"! -3 1 



\ ' 41ogn' J 



max{41ogn',log|Z'|}) + 2 " 2 
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where we have used the fact that and hence P'n, is at least some sufficiently large 

constant. ■ 
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